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ABSTRACT 

Small non-coding RNAs (ncRNAs) such as 
microRNAs, snoRNAs and tRNAs are a diverse 
collection of molecules with several important bio- 
logical functions. Current methods for high- 
throughput sequencing for the first time offer the 
opportunity to investigate the entire ncRNAome in 
an essentially unbiased way. However, there is a 
substantial need for methods that allow a conveni- 
ent analysis of these overwhelmingly large data 
sets. Here, we present DARIO, a free web service 
that allows to study short read data from small 
RNA-seq experiments. It provides a wide range of 
analysis features, including quality control, read nor- 
malization, ncRNA quantification and prediction of 
putative ncRNA candidates. The DARIO web site can 
be accessed at http://dario.bioinf.uni-leipzig.de/. 

INTRODUCTION 

High-throughput sequencing (HTS) using a small RNA 
preparation protocol (small RNA-seq) was primarily 
designed to measure the expression of microRNAs. 
Closer inspection of the resulting sequence libraries, 
however, revealed that many other ncRNA types are 
chopped into RNA molecules of microRNA-like length, 
and are hence detectable in the sequencing data as well 
(1). Some of the non-miRNA sources of short RNA se- 
quences include tRNAs (tRNA-derived fragments) (2-4), 
snoRNAs (snoRNA-derived small RNAs) (5), 21U-RNAs 



(6) or snRNAs (1). Recently, small RNA sequencing has 
helped to identify new RNA species such as microRNA 
offset RNAs (moRs), which derive from miRNA precur- 
sors. Although they have first been described in the simple 
chordate Ciona intestinalis (7), they could be verified in 
mammalian transcriptomes (8) and have later been 
linked to Kaposi's sarcoma-associated Herpesvirus (9,10). 

Hence, small RNA-seq data contain a plethora of 
processing and maturation products potentially 
including yet unknown RNA species. Despite this fact, 
many small RNA-seq data analysis tools such as 
miRanalyzer (11), miRDeep (12) or miRNAkey (13) 
focus on microRNAs — largely neglecting other types of 
RNAs. In addition, these programs are often restricted 
to specific sequencing platforms due to embedded 
mapping algorithms. Other tools such as deepBase do 
not allow the upload of own experimental data (14). 

In addition to finding new RNA species, the expression 
levels of ncRNAs have been shown to be associated with 
a number of different phenotypes. Various forms of neo- 
plastic diseases such as colorectal cancer (15), for instance, 
show changes in miRNA expression levels. Likewise, dif- 
ferential snoRNA expression has been found in a study 
with menigioma cells (16). RNA quantification is possible 
using tools such as rQuant.web (17) or RSEQTools (18); 
however, they are not readily applicable to small ncRNA 
analysis as annotation data must be collected from differ- 
ent sources. 

We have combined a ncRNA prediction method 
(1,8) with tools to quantify ncRNAs in a completely 
platform independent and easy to use web tool. DARIO 
performs RNA-seq quality controls and quantifies RNA 
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Figure 1. Simplified workflow of a DARIO computation. After the user upload, the data are run through some quality checks with regard to read 
lengths distributions and multiple mappings. Subsequently, the mapping loci are overlapped with ncRNA annotation data for gene expression 
measuring. A random forest classifier predicts new ncRNAs. The results of the analysis are easily accessible from a summary web page. 



expression based on annotated ncRNAs from different 
ncRNA databases. The expression data and ncRNA pre- 
dictions can be downloaded in the standardized BED 
format. We provide a script to locally convert SAM files 
and other mapping files to the BED format. The script is 
optimized to greatly reduce the amount of data that has 
to be uploaded to the DARIO server. 

MATERIALS AND METHODS 

Workflow 

The DARIO web service requires previously mapped 
reads stored in compressed or uncompressed files in 
BAM or BED format. The uploaded file is uncompressed, 
if necessary, and examined for validity. A first analysis of 
the input data provides measures for quality control. 
The reads are then overlapped with various gene models 
of the selected species relevant for the analysis of 
small ncRNAs. Mapping loci overlapping with exonic 
regions are excluded from further analysis. Mapping loci 
overlapping with introns and intergenic regions are used 
to predict non-annotated ncRNAs. Finally, the results are 
summarized in HTML pages and data tables. A simplified 
workflow of the DARIO web service is depicted in 
Figure 1. 



Sequence and annotation data 

Genome assemblies of six supported species were 
downloaded from the UCSC Genome Browser (http:// 
hgdownload.cse.ucsc.edu/downloads.html): Homo sapiens 
(hgl8, NCBI 36.1 and hgl9, GRCh37), Macaca mulatta 
(rheMac2, MGSC Merged 1.0), Mus musculus (mm9, 
NCBI37), Danio rerio (danRer7, Zv9), Drosophila 
melanogaster (dm3, BDGP Release 5) and Caenorhabditis 
elegans (ce6, WUSTL School of Medicine GSC and 
Sanger Institute version WS190). For each assembly, we 
retrieved the UCSC Known Genes Track using the 
UCSC Table Browser in order to generate intron/exon 
lists. 

ncRNA annotation was collected from several data- 
bases. While miRNA annotation was obtained from the 
miRBase vl6 (19), most of the other ncRNA loci were 
downloaded from the UCSC Genome Browser. For 
human ncRNA data sets, we additionally included 
tRNA track (20), wgRNA track (21) for snoRNAs and 
the rnaGene track for other ncRNAs. For mouse, the 
tRNA track was used. For fly, our annotation 
encompasses the flyBaseNoncoding track from FlyBase 
(22). The sangerRnaGgene track containing WormBase 
annotations (23) is provided for Worm ncRNA data 
analysis. Where necessary, annotations were lifted to 
alternative assemblies with the UCSC tool liftover 
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(http://hgdownload.cse.ucsc.edu/downloads.html). 
Additional ncRNA annotations were collected from the 
Mouse Genome Database (24) as well as from Ensembl/ 
BioMart for zebrafish (25). If tRNA or snoRNA annota- 
tions were not available, we predicted candidates using 
tRNAscan-SE (26) or snoReport (27), respectively. 

Webserver implementation 

The web site and the HTML results are created by a set 
of Python scripts and the Mako template engine. 
The jobs are scheduled in a queued fashion and distributed 
over a set of active machines. Upon completion, the 
results are transferred to the web server and available 
under a personalized link for 4 weeks. Mapping loci are 
merged to blocks based on their genomic positions and 
assembled to regions of blocks using blockbuster vl.O 
(8) with default parameters. These are then classified 
using the random forest method in WEKA v3.6 
(1,28,29). Graphics are created using R (30) and the 
ggplot2 graphics package (31). RNAz Version 1.0 (32) 
has been used to screen all supported assemblies for 
potential functional RNA structures. Predicted ncRNA 
candidates are overlapped with these screenings to 
provide RNAz support. 

RESULTS AND DISCUSSION 

The DARIO web site provides a simple web form that 
allows the user to specify and upload input data. The 
web site currently supports seven assemblies of six 
species: human (hgl8, hgl9), rhesus monkey (rheMac2), 
mouse (mm9), fruit fly (dm3), worm (ce6) and zebrafish 
(danRer6). After file upload, a job is created and queued 
for computation. The user may supply an email address to 
be notified upon job completion. A single job typically 
takes between 5 and 30 min. The results are summarized 
on a single web page containing job details, quality control 
measures and figures, ncRNA quantification and classifi- 
cation. All results can be downloaded for further analysis. 

Input format 

DARIO uses mapped sequences as input. The alignments 
may be provided in the common BAM or BED formats 
(http://genome.ucsc.edu/FAQ/FAQformat.html). The 
BED files require the fields for sequence identifier, 
strand and need to provide the read count in the score 
field. This format allows to collapse reads occurring 
multiple times into unique sequence tags, dramatically 
reducing space requirements of sequencing data. DARIO 
allows upload of (g)ZIPed files. 

We provide a small, no-dependency perl script to con- 
vert SAM and SOAP format files into the BED input 
format. Virtually, all common mapping tools (segemehl, 
BWA, SOAP, Bowtie, etc.) can write their output align- 
ment to either of these formats. 

Using genome loci of previously mapped reads, and 
thus decoupling read alignment and analysis, has a 
number of advantages over using raw sequence reads. 
First, DARIO has no dependencies to any sequencing 
platform or mapping tool. Thus, read data originating 



from any sequencing platform and aligned with any 
mapping program can be used. Second, this greatly 
reduces the required amount of data to be uploaded to 
the server (e.g. 1GB SAM file -> 15MB compressed 
BED file). 

Quality control 

There are numerous errors and biases that can occur 
during sample handling, library preparation and 
sequencing in a small RNA-seq experiment, rendering an 
assessment of the experiments quality a necessity (33-35). 
A basic set of figures (Figure 2) gives the researcher a first 
impression of the quality of the experiment. This includes 
the read length distribution, the number and occurrence of 
multiple mapped reads, the fraction of reads mapping to 
different genomic loci (exon, intron or intergenic) and 
ncRNA classes (miRNA, tRNA, snoRNA, etc.). Other 
measures include the number of mappable reads and the 
number of tags. 

RNA quantification 

For expression analysis, mapping loci are overlapped with 
annotated ncRNAs from a variety of sources. To handle 
multiple mappings, the number of reads for each sequence 
tag is divided by the number of its mapping loci. This 
normalized expression value is assigned to each mapping 
locus. These expression values are additionally normalized 
based on the absolute number of mappable reads (RPM), 
to allow subsequent differential expression analysis. Note 
that these measures do not necessarily reflect precursor 
ncRNA abundance as RNA processing and sequencing 
protocol lead to a non-uniform read distribution across 
the precursor RNA. 

A list of expressed ncRNAs, itemized by ncRNA 
classes, is generated (Figure 3). The user obtains informa- 
tion about the normalized expression, the number of 
mapped reads (raw and multimap normalized), as well 
as a link to the UCSC genome browser for each expressed 
locus. The UCSC link helps the experimenter to quickly 
scan the data for new types of ncRNAs, e.g. micro RNA- 
offset-RNAs (moRs) or vault RNAs, and to get a deeper 
understanding of the processing of these poorly under- 
stood ncRNA classes. 

The web interface allows the upload of own annotation 
tracks. The specified regions are included in all down- 
stream analysis. Predicted RNAs from previous DARIO 
runs can directly be used as user annotation. 

Classification 

DARIO predicts new ncRNAs using a previously pub- 
lished machine learning approach (1). This method relies 
on characteristic read patterns exhibited by different 
classes of ncRNA. The classifier achieves positive predict- 
ive values (PPVs) and recall rates of 0.8. With recall rates 
varying from 0.6 to 0.7 and PPVs between 0.7 to 0.8, 
snoRNA predictions mark the lower bound of the classi- 
fication [cf. Table 2 in (1)]. Reciever operator characteris- 
tic curves for all predicted ncRNAs in a number of species 
is given in the Supplementary Figure SI. For each candi- 
date, a prediction score is given along with a RNAz 
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Quality Control 

The following figures indicate whether problems might have occured during sample or library preparation. 
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Figure 2. The DARIO web server provides a set of graphics for quality control. The figures show the read length distribution, the number of 
multiple mappings, the distribution of mapping loci across the genome and the annotated non-coding RNAs. The user may immediately check the 
success of his short RNA sequencing run in terms of capturing the ncRNA of interest. 




Figure 3. The DARIO analysis output is partitioned into different ncRNA classes. For each ncRNA class, a list that may be sorted by location, 
name or expression criteria is provided. A link to the UCSC genome browser allows the instantaneous inspection of the ncRNAs, in this case a 
snoRNA including available ncRNA annotation tracks and conservation. 
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Figure 4. Example for a DARIO prediction for a miRNA. The integrated random forest classifier predicts a miRNA on the human chromosome 8 
in an intergenic region. The expression pattern shows a typical miR and miR* processing product constellation. Interestingly, the UCSC browser 
reports neither annotations nor conservation at this position. 



classification (32), if available. One of the candidate 
miRNAs predicted on the human chromosome 8 using 
the DARIO platform is shown in Figure 4. With the 
links to the UCSC genome browser, it is possible to in- 
stantaneously inspect the prediction by loading multiple 
different annotation tracks. 



CONCLUSION 

HTS offers wide-ranging possibilities for analyzing 
ncRNAs in an unprecedented way. However, deciphering 
the world of non-coding RNAs in HTS data requires tools 
that allow integrated analysis in a user-friendly way. We 
have developed the first integrated tool for the analysis 
and prediction of various small ncRNAs on user-provided 
RNA-seq data. The web service allows researchers to 
quickly grasp and assess the success of a short RNA-seq 
experiment. The web server overlaps the mapping loci 
with ncRNA genes from a number of ncRNA classes 
and annotation databases in order to quantify RNA abun- 
dance with different expression measures. Reads that 
do not map to annotated ncRNA genes are identified 
and classified. DARIO provides an easy to use web inter- 
face and thus greatly facilitates both initial evaluation and 
downstream analysis of read data originating from arbi- 
trary sequencing platforms. Further versions of DARIO 
will allow to directly compare sets of small RNA tran- 
scriptomes to evaluate differences in expression levels of 
ncRNAs. 

Availability and requirements 

DARIO can be accessed freely via the web browser using 
the URL http://dario.bioinf.uni-leipzig.de/. There are no 



restrictions on use and no login requirement. It has been 
tested with several browsers and works with Safari, 
Firefox and Internet Explorer. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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